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§.  problem 


which  involves  a  fairly  simple  form  of  decision  making  —  classifying  — 

jf 

and  a  special  type  of  algorithm  for  learning  to  solve  it^  The  problem 
is  that  of  classifying  correctly  individuals  which  cure  drawn  at  random 
from  a  population  which  is  partitioned  into  a  finite  number  of  cate¬ 
gories.  The  learning  process  is  required  to  be  a  step-by-step  procedure 
in  which  observations  are  made  on  individuals  one  at  a  time  .am*- the 
current  estimate  of  the  required  partitioning  may  be  adjusted  after  each 
observation /^on  the  basis  of  knowledge  of  the  category  in  which  the 
Individual  observed  falls.  At  any  given  time,  the  current  estimate  of 
the  partitioning  is  all  that  is  held  in  memory;  past  history  is  lost 
except  Insofar  as  it  has  been  incorporated,  into  the  present  estimate. 

The  learning  process  of  perceptrons,  as  veil  as  that  of  other  artificial 
intelligences,  is  of  this  general  form.  ^ 

Each  individual  is  characterized  by  a  vector  In  n -dimensional 
Euclidean  apace.  I  shall  assune  that  this  characterization  is  suffi¬ 
ciently  rich  with  respect  to  the  given  classification  problem,  by  which 
I  mean  the  following.  If  3^  is  the  smallest  closed  convex  set  which 
contains  all  the  vectors  which  describe  individuals  of  the  i  category, 
then  the  intersection  Sj  0  Sj  of  any  two  such  convex  sets  is  empty. 

This  terminology  is  appropriate  to  situations  for  which  in  the  case 
of  failure  of  the  condition  of  sufficient  richness,  a  re-examination 
of  the  world  of  individuals  and  the  subsequent  increasing  of  the  mxaber 
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of  components  of  the  characterizing  vectors  can  be  expected  to  yield 
a  new  characterization  for  which  this  condition  ia  satisfied.  The 
question  of  whether,  in  a  particular  case,  a  sufficiently  rich  char¬ 
acterization  can  be  achieved  is  obviously  crucial  but  beyond  the  scope 
of  this  paper.  Many  problems  are  ruled  out  by  this  requirement,  incit¬ 
ing  those  for  which  the  noise  level  of  the  measuring  instruments  is  too 
high  or  for  which  the  very  act  of  taking  the  measurement  changes  the 
category  of  the  individual,  as  well  as  those  which  involve  relations 
which  are  essentially  non-linear. 

2.  Notation  and  Assumptions.  I  shall  follow  the  convention 
of  using  upper  case  letters  to  denote  vectors  or  sets  of  vectors  and 
lower  case  letters  to  denote  scalers.  In  the  argument  below,  each 
individual,  which  is  characterized  by  a  vector  X  in  n-dimensional 
Euclidean  space,  is  a  member  of  one  and  only  one  of  two  categories. 

The  results  obtained  are  applicable  to  the  general  case,  however,  for 
they  may  be  applied  to  appropriate  partitions  of  a  set  of  three  or 
more  categories  into  two  subsets.  I  make  the  following  assumptions : 

(1)  X>0. 

(2)  0  <  h-L  <  |X|  <  ha  <  ». 

(3)  There  exists  a  plane  B*X  ■  1  and  a  positive  number  c* 
such  that  if  X  e  Si,  B*X  >  1  +  c*  and  if  X  «  Sg, 

B*X  <  1  -  c$  (obviously,  if  one  such  plane  exists, 
so  do  an  infinity  of  planes.) 
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The  problem  is  to  estimate  a  vector  B*  sequentially  so  that 
each  time  a  vector  X  is  observed,  the  current  estimate  of  B*  is  subject 
to  revision  in  accordance  with  a  rule  which  depends  upon  knowledge  of 
whether  the  vector  X  is  in  S^  or  in  Sg.  This  rule  will  be  described 
in  Section  3;  and  its  convergence  properties  will  be  discussed  in 
Section  k. 

The  case  where  the  dividing  plane  passes  through  the  origin 
and  the  vectors  are  binary  was  analyzed  by  Papert  [l];  and  the  algo¬ 
rithm  in  Section  3  is  a  natural  extension  of  the  one  is  [l]*  Die 
results  in  this  paper  are  relevant  to  the  case  where  it  is  not  known 
that  a  plane  which  separates  the  regions  passes  through  a  particular 
pcint.  If  such  a  point  is  known,  then  a  translation  which  moves  it 
to  the  origin  will  allow  use  of  the  algorithm  in  [l]. 

3.  The  Algorithm.  It  is  assumed  that  sampling  is  random 
and  that  initially  there  are  two  samples:  X^  ,  .  .  .  ,  XjP  from 
and  Xg1  ,  .  .  .  ,  Xg'l  fran  Sg.  Let  X*  be  the  tth  vector  sampled 
after  the  initial  p  +  q  vectors.  Estimate  B*  as  follows: 

(l.)  Let  the  initial  estimate  of  B*  be  B^,  any  vector  for 
which  the  plane  B^X  =  1  separates  the  initial  p  +  q 
vectors  so  that  B2^  X^d  >  1,  for  d  ■  1,  .  .  .  ,  p, 
and  B1X2d  <  1,  for  d  ■  1,  .  .  .  ,  q. 

(2.)  Obtain  the  (t  +  l)st  estimate  fran  the  tth  estimate 
from  the  equation: 

B^l  *  B^  +  X*  -  erP  X^,  where  the  e*s  sure  determined 
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in  accordance  with  the  following  rules: 

(a)  If  X'*'  e  Sp  then 
*2  »  0; 


*  0,  if  B*  X*  >  1; 


and  e-,'*'  =  (l  ~  X  )  +  0  {  1  j  f  ^ 

^l2  \t  / 

(b)  If  Xt  e  S2J  then 


if  Bt  Xt  <  1. 


&2  =  0,  if  B*  X*  <  1; 

and  e  *  =  ~  2)  +  o  /i  \  ,  if  B*  X11  >  1. 


,  .  /II 

(c)  o  /  _2 ) is  positive  and  sufficiently  small  so  that  the 
\t  / 

addition  (case  a)  or  subtraction  (case  b)  of  the  term 


does  not  change  the  sign  of  any  component 


of  B 


4.  Convergence  Properties  of  the  Algorithm.  Ihe  vector 


B  may  be  written: 

Tjt  Tjl  ,  /  1  „1  ,  t“l  t-lv  f  1  1  t”l  t*l. 

B  =  B  +  (e^  X  +  .  .  «  e^  X  )  -  (e^  X  +  •  .  •  e^  X  ) 

=  B1  +  r*  Z*  -  r*  z\  , 

where  Z*  -  (e1  X1  +  .  .  .  ej-1  X*”1)  /  r* 
i  i  l  i 

and  r^  =  e^  +  .  .  .  e^"^  ,  for  i  =  1,  2. 

Obviously,  z\,  e  Also,  since  r^  is  positive  and  non -decreasing 

either  r*  -*  r^  ,  a  finite  limit,  or  else  r^  —  •  .  Consider  each  of 

the  four  cases  separately: 
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Caae  A.  Suppose  -*  r  and  r|  -*  r^.  Since  |x|  is  bounded,  it 
follows  that  z|  -«  Z^.  'Therefore,  -*  +  r^_  Z-j_  -  rg  Zg. 

Case  B.  Suppose  r£  -  r^  and  r|  -  ®.  Since  0  <  h1  <  |z|j ,  it 
follows  that  r|  |z||  -®.  therefore,  for  this  case,  |B^ |  -®. 

Case  C.  Suppose  -•  ®  and  r|  -»  rg.  This  is  similar  to  Case  B. 

Case  D.  Suppose  r*  -•  «  and  r|  -  «.  For  this  case,  |b^ |  -*°», 

unless  both  the  following  conditions  hold: 

(1)  Z^  and  Zg  have  the  same  direction  in  the  limit  and 

(2)  lim  |Z*1  /  jz||  =  lim  r|  /  r* 

t  -*  ®  t  “ 


These  conditions  are  necessary  for  Ib^  |  to  converge  to  a  finite  value. 

Suppose  they  hold.  Let  Z*  =  lim  Z*  /  |z*|  *  lim  z|  /  |z*|  .  Then, 

t  -•  00  1  1  t  -»  o»  d 

lim  Bt  =  B1  +  lim  (r*  |z*|  -  r\  |Z*l  )  Z*. 
t  -*  ®  t  -  ®  1 

For  simplicity,  the  term  B^  will  now  be  dropped.  It  is  assumed  that  B1  has 
been  expressed  as  the  difference  of  linear  combinations  of  finite  sets  of 
vectors  of  and  Sg,  respectively,  and  that  these  combinations  are  includ¬ 
ed  in  r*  Z^  and  rg  Zg  ,  respectively. 

It  remains  to  investigate  the  possibility  of  oscillation  for 
this  case  (D).  Suppose  the  sequence  {B*}  contains  a  subsequence  {B  *} 
for  which  ’|B  i|  converges  to  a  finite  value.  It  then  follows  that  for  J  *  1, 

2,  /  IZj1!  -♦  Z*,  where  Z*  is  a  unit  vector,  lim  rj*  /  r21»llm  |Zg^|/|Z][^l, 

vi  L  t  T. 

and  B  -*  B  ,  a  finite  vector.  Suppose  B  does  not  converge  to  .  It 

shall  be  shown  that  this  assumption  leads  to  a  contradiction  with  proba- 

y 

bility  one.  Under  this  assumption,  the  sequence  {B  i}  contains  a  subse¬ 
quence  (B  i}  for  which  B^1  -*  B^  and  B^*  ^  does  not  converge  to  BL.  Since 

1b  Zi1(1  -  ej1  /  (  r*1  +  e*1  ))+  (e^1  /(rj1*  e*1  ))  X*1 
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ti  +  1  +  1.  t.  +  1  .  t4  +  1. 

Z1  /  lzi  I  converges  to  Z*.  Similarly,  Zg1  /  I  Zg1  |  -»Z*. 

*1  +  t±  +  1  .  t*  ti  ,  ,,  t<  t<  . 

Also,  lim  r^  /  *2  =  11m  (r^1  +  e  )/(rgX  +  eg1  )  “ 

11m  IZg1!  /  l Z^  l .  Urns  in  the  limit,  the  plane  B  1  X  =  1  is  perpen¬ 
dicular  to  the  line  determined  by  the  origin  and  Z*  and  its  distance  from 
the  origin  oscillates  finitely;  the  probability  that  this  event  will  not 
occur  is  one  unless  the  points  which  are  not  oriented  correctly  with  respect 
to  the  plane  BL  X  =  1  all  lie  on  the  line  determined  by  the  origin  end  Z*. 
But  in  this  case,  it  is  Impossible  for  both  r^  end  r^  to  go  to  Infinity. 

From  the  analysis  of  the  four  cases  above,  it  follows  that  either 
B^  approaches  a  limit  or  else  [  B^  |  -  “ .  It  shall,  now  be  proved  that  if 
|Bt|  -oo  t  the  plane  B^  X  =  1  converges  to  a  finite  limit  with  probability 
one,  if  convergence  is  defined  as  follows: 

Definition:  Let  the  vector  be  written  B^  *=  c*  where  c*  >  0 
and  lAt|  =  lj  if  as  c^  -  “,  A^  converges  to  a  vector  A,  then 
the  plane  B^  X  =  1  is  said  to  converge  to  the  plane  A  X  =  0. 

The  proof  below  requires  the  following  lemma. 

Lemma.  If  Xt  e  sx  and  B*  X*  <  -  1  -  o(l/t2)  or  if  X*  €  Sg  and 
Bt  X*  >  1,  then  |Bt  +  1|  <  |b^|,  for  sufficiently  large  t. 

Proof.  Suppose  X^  e  S-^  and  B^  X^  <  -  1.  In  this  case  B*  +  ^  = 

B*  +  e*  X*.  Thus  |Bt  +  x|2  -  |Bt|2  -  2e\  Bt  X*  +  (e\)2  Therefore, 

since  e£  =  (l  -  Bt  X*)  /  lx*!2  +  o(l/t2),  (Bt  +  1|  <  (b^I  if 
Bt  X^  <  -  1  -  |X*|2  o(l/t2)  .  Since  |xt|  is  bonded,  the  first  part  of 
the  lemma  follows. 

Nov  suppose  X*  «  S„  and  B*  X1  >  1  .  Since  B*  *  1  -  B*  -  e*  X1, 


|b‘  +  l|2  -  |b1|8  .  -  a.*  b*  x*  +  (®*)2  lx*!2. 


Therefore, 
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since  e|  -  (Bt  Xt  -  l)  /  [xt| 2  +  o(l/t2),  Ib*1  +  1|  <  IB1!  if  B*  X*  > 

-  1  +  lx*1 1 2  o(l/t2).  Elis  inequality  will  hold  if  B*  x"*  >  1  provided 
t  is  sufficiently  large. 

Since  the  distance  of  the  plane  B*  X  =  1  from  the  origin  is 
equal  to  l/ 1 B* |  it  suffices  to  show  that  if  the  plane  does  not  converge 
to  a  limit,  no  matter  how  close  the  plane  gets  to  the  origin,  it  will 
eventually  move  away  from  the  origin  with  probability  1.  If  a  plane 
which  is  correctly  oriented  with  respect  to  seme  points  in  both  S1  and 
Sg  approaches  the  origin  but  does  not  converge  tc  a  limit,  it  must, 
when  it  gets  sufficiently  close,  intersect  S-^,  S 2,  or  both.  If  it 
intersects  Sg,  since  sampling  is  random,  the  probability  is  one  that 
eventually  a  vector  from  Sg  will  be  sampled,  which  in  accordance  with 
the  lemma,  will  result  in  the  plane's  moving  away  from  the  origin. 
Suppose,  on  the  other  hand,  that  the  plane  intersects  S^j  in  this  case 
there  are  two  possibilities: 

(l)  There  exists  no  plane  which  passes  through  the  origin 
and  separates  and  Sg. 

This  implies  that  when  t  is  sufficiently  large,  i.e.,  when  the  plane 
is  sufficiently  close  to  the  origin,  there  will  be  vectors  V  of  S1 
for  which  B^  V  <  -  1  -  o(l/t2);  and  when  such  a  vector  is  sampled, 
the  plane  will  move  away  from  the  origin.  Furthermore,  the  probability 
that  |b*  +  "H  <  |B*|,  i.e.,  the  probability  of  a  random  vector  not 
falling  in  the  region  between  the  parallel  planes  B*  X  ■  1  and  B*  X  » 

-  1  -  o(l/t2),  becomes  arbitrarily  close  to  one  as  Ib**!  -<*, 
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(2)  There  exist  planes  which  pass  through  the  origin  and 
which  separate  3.^^  from  Sg. 

In  this  case,  if  (B^j  -  00 ,  the  sequence  of  planes  B'fc  X  -  1  converges 
to  such  a  plane.  This  can  he  demonstrated  as  follows.  When  t  is 
sufficiently  large  no  points  of  S2  will  lie  between  the  plane  B^  X  =  1 
and  a  separating  plane  which  passes  through  the  origin,  1.  e.,  -* 00 

and  Tg  -*  r*  <  00 .  Let  {B^i)  he  a  subsequence  of  (B^)  for  which  the 

t/4  ^  1 

sequence  of  planes  B  x  X  ®  1  converges#  Since  Z^A  *  Z^ 

(l  -  ei1  /  (r*1  +  ei1  ))  +  (e^  /  (r^1  +  e*1  ))  X*1  , 

t.  +  1 

it  follows  that  the  sequence  of  planes  B  1  X  =  1  converges  to  the 

ti 

limit  of  the  sequence  of  planes  B  X  »  1. 

This  completes  the  proof  of  the  following  theorem. 

Theorem:  If  assumptions  (l)  through  (3)  hold,  the  plane 
B^  X  =  1  converges  to  a  limit  with  probability  one. 

5*  Conclusions.  It  has  been  assumed  that  sampling  is  random 
from  the  entire  population  of  individuals  to  be  classified.  If  strat¬ 
ified  sampling,  1.  e.,  by  categories,  is  permitted,  convergence  may  be 
made  more  rapid.  It  may  be  feasible,  for  example,  to  alternate 
categories  by  sampling  from  a  given  category  as  long  as  the  vector 
sampled  requires  an  adjustment  in  the  estimate  of  the  dividing  plane, 
and  as  soon  as  a  vector  is  obtained  which  is  correctly  oriented  with 
respect  to  the  plane,  switching  to  the  alternative  category.  Further¬ 
more,  if  any  information  about  the  distribution  of  X  is  available,  it 
might  be  practicable  to  Incorporate  it  into  the  stratified  sampling  plan. 
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It  is  of  interest  to  determine  whether  this  estimating  proce¬ 
dure  is  applicable  to  non-static  situations  which  exhibit  a  shifting 
in  the  characteristics  of  the  categories  with  time.  It  is  conjectured 
that  the  answer  is  yes,  provided  that  changes  are  sufficiently  gradual 
and  that  at  any  given  time  the  assumptions  above  hold. 
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